Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean) by someone114514 · Pull Request #1850 · openai/parameter-golf

someone114514 · 2026-04-27T06:20:47Z

Summary

3-seed mean val_bpb 1.00495 (std 0.00072). Best/min seed is 1.00425333 BPB (seed 1337). Compared to the merged 2026-04-09 SP8192 legal TTT record at 1.0810 BPB, this improves by 0.0761 BPB, comfortably past the 0.005-nat threshold and over 100x the observed inter-seed std. All three artifacts stay under the 16 MB cap.

The submission adds one scoring component on top of the existing SP8192 training stack: a binary-lambda-gated PPM-D byte-level mixture applied to the sliding-window NN log-probs at eval time. The mixture is constructed to fit the score-before-update discipline: each byte is scored from the prefix PPM state, then inserted into the PPM counts for future bytes.

metric	value
val_bpb (PPM mixture, 3-seed mean)	1.00495
std across seeds	0.00072
best/min seed	1.00425333
improvement vs base legal TTT (1.0810)	0.0761 BPB
training	8xH100 SXM, 600s cap, ITERATIONS=20000
eval	sliding-window stride=64 + strict full-val byte PPM mixture, under 600s
total_submission_bytes_max	15,997,433
cap margin at max artifact	2,567 B
seeds run	42, 7, 1337

The Contribution

A binary-lambda-gated PPM-D mixture over an already-scored byte stream, computed at eval time and mixed with the NN's per-byte log-probabilities in probability space.

For each predicted byte at position t:

NN probability: the per-token NN NLL from the existing causal sliding-window evaluation is deterministically spread uniformly over the bytes emitted by that target token. This uses the already-computed sliding NLLs; there are no extra NN forward passes.
PPM probability: classical byte-level PPM-D style scoring over the 256-byte alphabet. Counts are built online from already-scored validation bytes only. No future bytes are read.
Gate: the binary mixture lambda is selected from prefix context confidence before observing the current byte. If the deepest available context has high confidence, lambda_lo=0.05 mostly trusts PPM; otherwise lambda_hi=0.9 mostly trusts the NN.
Mix: p_mix = lambda * p_NN + (1 - lambda) * p_PPM, then -log(p_mix) contributes to byte BPB.
Update: PPM counts are incremented only after the byte's mixed log-probability is recorded.

The implementation uses PPM_ORDER=4, PPM_LAMBDA_HI=0.9, PPM_LAMBDA_LO=0.05, and PPM_CONF_THRESHOLD=0.9 in the submitted logs.

Why this helps here: the parameter-constrained SP8192 NN still has a byte-level surprisal floor on highly repetitive local byte contexts such as identifiers, URLs, numeric literals, and repeated formatting fragments. PPM is strong exactly in those high-confidence local contexts. The binary gate is intentionally conservative: it trusts PPM only when the prefix counts indicate a strong local continuation, and otherwise falls back toward the NN.

Per-Seed Results

Seed	Pre-Quant / Post-EMA	PPM mix	Artifact (B)
1337	1.08627037	1.00425333	15,993,603
42	1.08711004	1.00489563	15,997,433
7	1.08750246	1.00569239	15,995,226
Mean	1.08696	1.00495	15,995,421
Std		0.00072

Three independent seeds, all with ppm_mix < 1.006. The headline number is the PPM mixture returned as quantized_sliding_window val_bpb. The logs also report nn_token_bpb, nn_byte_bpb, and ppm_only for auditability.

Legality / Issue #1017

The PPM mixture is implemented inside a strict score-before-update eval-time path.

Condition	How this submission satisfies it
1. Causality	Sliding-window NN scoring is strictly causal. Each token is scored from prefix tokens only. PPM context is the byte-prefix of already-scored bytes, never future bytes.
2. Normalized distribution	PPM-D produces a normalized distribution over the 256-byte alphabet through its escape mechanism. The final mixture is in probability space, so it is normalized by construction. The NN side remains the standard softmax over the full vocab.
3. Score before update	NN sliding scores are computed before PPM bytes are updated. Each byte is scored from existing PPM counts and only then inserted into the count tables.
4. Single pass	Each validation byte is scored exactly once in stream order. There is no rescoring, no multi-pass selection, and no prebuilt validation cache.

Additionally:

No SLOT.
No TTT in the packed artifact.
No pre-quant validation adaptation.
No ETLB / logit bias.
No n-gram cache.
No external network access at eval time.
PPM state is built fresh inside eval_val_sliding for each run and is not persisted across invocations.
Tokenizer is unchanged from the base SP8192 stack.

Implementation Notes

The scorer is native C compiled at runtime with gcc -O3 from the packed script. It uses:

open-addressed context tables,
rolling byte context keys,
inline counts for the first four bytes per context,
fixed order-0 byte counts,
cached integer logs,
precomputed lambda logs,
compact raw per-rank token/NLL files in /tmp for distributed sliding collection.

The Python PPM reference and eval-time TTT were removed from the packed artifact to keep the submission under the 16 MB cap. Native exactness was checked against the Python reference during development before trimming.

Compliance Numbers

item	value
max final_model.int6.ptz	15,976,001 B
packed train_gpt.py	21,432 B
max total submission	15,997,433 B
cap margin	2,567 B

All three seeds are under 16,000,000 bytes.

Files

records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/submission.json
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/README.md
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed1337.log
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed42.log
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed7.log

Reproduce

python3 -m pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

RUN_ID=strict_ppm_trim_seed42_8gpu_order4_b32 \
SEED=42 \
PPM_ENABLED=1 \
PPM_NATIVE_ENABLED=1 \
PPM_ORDER=4 \
PPM_LAMBDA_HI=0.9 \
PPM_LAMBDA_LO=0.05 \
PPM_CONF_THRESHOLD=0.9 \
PPM_LOG_CACHE_SIZE=1048576 \
SKIP_QUANTIZED_EVAL=1 \
SLIDING_BATCH_SEQS=32 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py

Change SEED and RUN_ID for seeds 7 and 1337.

…ocal experiments Made-with: Cursor

… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU

… notes spec 052: PPM-D byte mixture port from PR openai#1850 onto 047B + our anti-hijack gate tuning. Phase 1 measured end-to-end at mix_bpb_sidecar = 1.00506, matching PR openai#1850's 1.00495 within 0.0001. spec 055: full submission run — train 050 baseline from scratch, apply same tuned PPM at eval. Single train_gpt.py file. Predicts 1.005 +/- 0.003. Code: exp/055-050-with-ppm-fullrun @ c27be23. ideas: - ppm-port-on-047B.md — narrative of the PPM port discovery, headroom analysis (1850 vs 1857 vs us), and why anti-hijack was the bigger lever. - ppm-d-mixture-and-anti-hijack.md — full math: per-token NN -> per-byte spreading, PPM-D Howard escape-D, the gate (1850 raw + anti-hijack override), log-sum-exp mixture, and the 4.32-bit hijack geometry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier default 4194304 (OMP-chunked) was suboptimal — saves ~230s eval time but loses ~0.010 BPB sidecar from chunk-reset penalty. PR openai#1850 chose single- pass deliberately and pays the 252s scoring cost for the bigger gain. Single-pass timing on 8H per 1850's measurements: pre-quant + gptq + ema: ~85s diagnostic quantized eval: ~60s non-overlap forward (8-way): ~20s file gather: ~5s single-pass PPM scoring: ~250s (CPU-bound, not GPU) ──────────────────────────────────── total eval phase: ~420s under 600s cap Smokes (where wallclock matters more than gain) can override with PPM_OMP_CHUNK_TOKENS=4194304. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WHY: V1 NGramMixer (fixed-order bigram + Dirichlet uniform smoothing) failed because cold-start q_bi was uniform → mixing in noise. V2 (TempScaler) failed because the trained NN is already calibrated. The actual large entropy gap that PPM byte mixture exploits is *local verbatim repetition* (URLs, code identifiers, repeated phrases) that a 5M-param NN averages over. WHAT: Cleary-Witten 1984 PPM-D over the SP token alphabet (Σ_token=8192), with backoff via escape mechanism. Distribution defined on Σ_token resolves the byte-vs-token C2 dispute (Issue openai#1872) cleanly. Binary λ gate (PR openai#1850 pattern): if PPM confidence at deepest matched context ≥ threshold, λ=lambda_lo (mostly trust PPM); else λ=lambda_hi (mostly trust NN). LEGALITY: All four conditions of Issue openai#1017: C1: ctx[k] only contains counts from already-scored tokens C2: P_K(·|prev) = recursive PPM-D blend, sums to 1 over Σ_token (verified by `test_ppm_c2_full_normalized`); convex combination with NN softmax preserves normalization C3: λ-gate uses confidence at deepest matched context (prev-only), computed before observing target. update_stream is called AFTER mix_nll C4: monotonic state, single left-to-right pass VALIDATION: 23/23 unit tests pass on CPU including a functional toy benchmark — on a chunked synthetic stream with strong repetition motifs, PPM gives -3.2 nats/token improvement vs NN baseline. (Real FineWeb is much less repetitive but the byte-level PPM cluster has shown -0.05 to -0.20 BPB improvements on this challenge, suggesting token-level can capture similar entropy.) INTEGRATION: eval_val sub-chunked W=128 (env: PPM_CHUNK_TOKENS) so within- batch repetition is captured. State carries across batches via the mixer object. Eval_val_ttt_phased path NOT touched yet (would need per-doc-slot PPM tables; deferred to V4 if V3 numbers warrant). ENV: PPM_MIX_ENABLED, PPM_MAX_ORDER (default 2), PPM_LAMBDA_LO (0.05), PPM_LAMBDA_HI (0.9), PPM_CONF_THRESHOLD (0.9), PPM_CHUNK_TOKENS (128).

Record: SP8192 + Strict Full-Val Byte PPM Mixture, 1.00495 BPB

37ce906

someone114514 force-pushed the sp8192-strict-fullval-ppm-0426 branch from 304dff5 to 37ce906 Compare April 27, 2026 06:29

phaniratan1234 pushed a commit to phaniratan1234/parameter-golf that referenced this pull request Apr 27, 2026

Add dev track: verbatim PR openai#1850 PPM snapshot (openai pg) for l…

947c0cd

…ocal experiments Made-with: Cursor

This was referenced Apr 27, 2026

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean) #1857

Closed

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

andrewbaggio1 mentioned this pull request Apr 27, 2026

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872

Open

someone114514 force-pushed the sp8192-strict-fullval-ppm-0426 branch from 58bed14 to 37ce906 Compare April 27, 2026 23:40

ndokutovich mentioned this pull request Apr 28, 2026

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621 #1881

Open

This was referenced Apr 28, 2026

Record: PR #1850 + Anti-Hijack Gate — val_bpb 0.99445 (full val) #1885

Open

Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean)#1850

Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean)#1850
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:sp8192-strict-fullval-ppm-0426

someone114514 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

someone114514 commented Apr 27, 2026

Summary

The Contribution

Per-Seed Results

Legality / Issue #1017

Implementation Notes

Compliance Numbers

Files

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant